Computation and Language 58
☆ OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Ningyu Zhang, Jiang Yong, Pengjun Xie, Fei Huang, Huajun Chen
Machine writing with large language models often relies on
retrieval-augmented generation. However, these approaches remain confined
within the boundaries of the model's predefined scope, limiting the generation
of content with rich information. Specifically, vanilla-retrieved information
tends to lack depth, utility, and suffers from redundancy, which negatively
impacts the quality of generated articles, leading to shallow, repetitive, and
unoriginal outputs. To address these issues, we propose OmniThink, a machine
writing framework that emulates the human-like process of iterative expansion
and reflection. The core idea behind OmniThink is to simulate the cognitive
behavior of learners as they progressively deepen their knowledge of the
topics. Experimental results demonstrate that OmniThink improves the knowledge
density of generated articles without compromising metrics such as coherence
and depth. Human evaluations and expert feedback further highlight the
potential of OmniThink to address real-world challenges in the generation of
long-form articles.
☆ OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking
☆ Enhancing Lexicon-Based Text Embeddings with Large Language Models
Recent large language models (LLMs) have demonstrated exceptional performance
on general-purpose text embedding tasks. While dense embeddings have dominated
related research, we introduce the first Lexicon-based EmbeddiNgS (LENS)
leveraging LLMs that achieve competitive performance on these tasks. Regarding
the inherent tokenization redundancy issue and unidirectional attention
limitations in traditional causal LLMs, LENS consolidates the vocabulary space
through token embedding clustering, and investigates bidirectional attention
and various pooling strategies. Specifically, LENS simplifies lexicon matching
by assigning each dimension to a specific token cluster, where semantically
similar tokens are grouped together, and unlocking the full potential of LLMs
through bidirectional attention. Extensive experiments demonstrate that LENS
outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB),
delivering compact feature representations that match the sizes of dense
counterparts. Notably, combining LENSE with dense embeddings achieves
state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).
☆ Enhancing Lexicon-Based Text Embeddings with Large Language Models
☆ Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models
Machine learning developers frequently use interactive computational
notebooks, such as Jupyter notebooks, to host code for data processing and
model training. Jupyter notebooks provide a convenient tool for writing machine
learning pipelines and interactively observing outputs, however, maintaining
Jupyter notebooks, e.g., to add new features or fix bugs, can be challenging
due to the length and complexity of the notebooks. Moreover, there is no
existing benchmark related to developer edits on Jupyter notebooks. To address
this, we present the first dataset of 48,398 Jupyter notebook edits derived
from 20,095 revisions of 792 machine learning repositories on GitHub, and
perform the first study of the using LLMs to predict code edits in Jupyter
notebooks. Our dataset captures granular details of cell-level and line-level
modifications, offering a foundation for understanding real-world maintenance
patterns in machine learning workflows. We observed that the edits on Jupyter
notebooks are highly localized, with changes averaging only 166 lines of code
in repositories. While larger models outperform smaller counterparts in code
editing, all models have low accuracy on our dataset even after finetuning,
demonstrating the complexity of real-world machine learning maintenance tasks.
Our findings emphasize the critical role of contextual information in improving
model performance and point toward promising avenues for advancing large
language models' capabilities in engineering machine learning code.
☆ Suggesting Code Edits in Interactive Machine Learning Notebooks Using
Large Language Models
☆ Attention based Bidirectional GRU hybrid model for inappropriate content detection in Urdu language
With the increased use of the internet and social networks for online
discussions, the spread of toxic and inappropriate content on social networking
sites has also increased. Several studies have been conducted in different
languages. However, there is less work done for South Asian languages for
inappropriate content identification using deep learning techniques. In Urdu
language, the spellings are not unique, and people write different common
spellings for the same word, while mixing it other languages, like English in
the text makes it more challenging, and limited research work is available to
process such language with the finest algorithms. The use of attention layer
with a deep learning model can help handling the long-term dependencies and
increase its efficiency . To explore the effects of the attention layer, this
study proposes attention-based Bidirectional GRU hybrid model for identifying
inappropriate content in Urdu Unicode text language. Four different baseline
deep learning models; LSTM, Bi-LSTM, GRU, and TCN, are used to compare the
performance of the proposed model. The results of these models were compared
based on evaluation metrics, dataset size, and impact of the word embedding
layer. The pre-trained Urdu word2Vec embeddings were utilized for our case. Our
proposed model BiGRU-A outperformed all other baseline models by yielding 84\%
accuracy without using pre-trained word2Vec layer. From our experiments, we
have established that the attention layer improves the model's efficiency, and
pre-trained word2Vec embedding does not work well with an inappropriate content
dataset.
☆ Attention based Bidirectional GRU hybrid model for inappropriate content
detection in Urdu language
☆ Comparative Insights from 12 Machine Learning Models in Extracting Economic Ideology from Political Text
This study conducts a systematic assessment of the capabilities of 12 machine
learning models and model variations in detecting economic ideology. As an
evaluation benchmark, I use manifesto data spanning six elections in the United
Kingdom and pre-annotated by expert and crowd coders. The analysis assesses the
performance of several generative, fine-tuned, and zero-shot models at the
granular and aggregate levels. The results show that generative models such as
GPT-4o and Gemini 1.5 Flash consistently outperform other models against all
benchmarks. However, they pose issues of accessibility and resource
availability. Fine-tuning yielded competitive performance and offers a reliable
alternative through domain-specific optimization. But its dependency on
training data severely limits scalability. Zero-shot models consistently face
difficulties with identifying signals of economic ideology, often resulting in
negative associations with human coding. Using general knowledge for the
domain-specific task of ideology scaling proved to be unreliable. Other key
findings include considerable within-party variation, fine-tuning benefiting
from larger training data, and zero-shot's sensitivity to prompt content. The
assessments include the strengths and limitations of each model and derive
best-practices for automated analyses of political content.
☆ Comparative Insights from 12 Machine Learning Models in Extracting
Economic Ideology from Political Text
☆ Domain Adaptation of Foundation LLMs for e-Commerce
Christian Herold, Michael Kozielski, Tala Bazazo, Pavel Petrushkov, Hadi Hashemi, Patrycja Cieplicka, Dominika Basaj, Shahram Khadivi
We present the e-Llama models: 8 billion and 70 billion parameter large
language models that are adapted towards the e-commerce domain. These models
are meant as foundation models with deep knowledge about e-commerce, that form
a base for instruction- and fine-tuning. The e-Llama models are obtained by
continuously pretraining the Llama 3.1 base models on 1 trillion tokens of
domain-specific data.
We discuss our approach and motivate our choice of hyperparameters with a
series of ablation studies. To quantify how well the models have been adapted
to the e-commerce domain, we define and implement a set of multilingual,
e-commerce specific evaluation tasks.
We show that, when carefully choosing the training setup, the Llama 3.1
models can be adapted towards the new domain without sacrificing significant
performance on general domain tasks. We also explore the possibility of merging
the adapted model and the base model for a better control of the performance
trade-off between domains.
☆ Domain Adaptation of Foundation LLMs for e-Commerce
☆ Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li
Language has long been conceived as an essential tool for human reasoning.
The breakthrough of Large Language Models (LLMs) has sparked significant
research interest in leveraging these models to tackle complex reasoning tasks.
Researchers have moved beyond simple autoregressive token generation by
introducing the concept of "thought" -- a sequence of tokens representing
intermediate steps in the reasoning process. This innovative paradigm enables
LLMs' to mimic complex human reasoning processes, such as tree search and
reflective thinking. Recently, an emerging trend of learning to reason has
applied reinforcement learning (RL) to train LLMs to master reasoning
processes. This approach enables the automatic generation of high-quality
reasoning trajectories through trial-and-error search algorithms, significantly
expanding LLMs' reasoning capacity by providing substantially more training
data. Furthermore, recent studies demonstrate that encouraging LLMs to "think"
with more tokens during test-time inference can further significantly boost
reasoning accuracy. Therefore, the train-time and test-time scaling combined to
show a new research frontier -- a path toward Large Reasoning Model. The
introduction of OpenAI's o1 series marks a significant milestone in this
research direction. In this survey, we present a comprehensive review of recent
progress in LLM reasoning. We begin by introducing the foundational background
of LLMs and then explore the key technical components driving the development
of large reasoning models, with a focus on automated data construction,
learning-to-reason techniques, and test-time scaling. We also analyze popular
open-source projects at building large reasoning models, and conclude with open
challenges and future research directions.
☆ Towards Large Reasoning Models: A Survey of Reinforced Reasoning with
Large Language Models
comment: 36 pages, 5 figures
☆ The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models
The recent rise in the popularity of large language models has spurred the
development of extensive code datasets needed to train them. This has left
limited code available for collection and use in the downstream investigation
of specific behaviors, or evaluation of large language models without suffering
from data contamination. To address this problem, we release The Heap, a large
multilingual dataset covering 57 programming languages that has been
deduplicated with respect to other open datasets of code, enabling researchers
to conduct fair evaluations of large language models without significant data
cleaning overhead.
☆ The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating
Large Language Models
comment: Pre-Print. Accepted to FORGE 2025 Dataset Track
☆ CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding COLING 2025
In today's assistant landscape, personalisation enhances interactions,
fosters long-term relationships, and deepens engagement. However, many systems
struggle with retaining user preferences, leading to repetitive user requests
and disengagement. Furthermore, the unregulated and opaque extraction of user
preferences in industry applications raises significant concerns about privacy
and trust, especially in regions with stringent regulations like Europe. In
response to these challenges, we propose a long-term memory system for voice
assistants, structured around predefined categories. This approach leverages
Large Language Models to efficiently extract, store, and retrieve preferences
within these categories, ensuring both personalisation and transparency. We
also introduce a synthetic multi-turn, multi-session conversation dataset
(CarMem), grounded in real industry data, tailored to an in-car voice assistant
setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to
.95 in preference extraction, depending on category granularity. Our
maintenance strategy reduces redundant preferences by 95% and contradictory
ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively,
the results demonstrate the system's suitability for industrial applications.
☆ CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through
Category-Bounding COLING 2025
comment: Accepted for presentation at the International Conference on
Computational Linguistics (COLING 2025)
☆ From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs
Hrithik Majumdar Shibu, Shrestha Datta, Md. Sumon Miah, Nasrullah Sami, Mahruba Sharmin Chowdhury, Md. Saiful Islam
The rapid spread of fake news presents a significant global challenge,
particularly in low-resource languages like Bangla, which lack adequate
datasets and detection tools. Although manual fact-checking is accurate, it is
expensive and slow to prevent the dissemination of fake news. Addressing this
gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news
detection. This version includes 11,700 additional, meticulously curated fake
news articles validated from credible sources, creating a proportional dataset
of 47,000 authentic and 13,000 fake news items across 13 categories. In
addition, we created a manually curated independent test set of 460 fake and
540 authentic news items for rigorous evaluation. We invest efforts in
collecting fake news from credible sources and manually verified while
preserving the linguistic richness. We develop a benchmark system utilizing
transformer-based architectures, including fine-tuned Bidirectional Encoder
Representations from Transformers variants (F1-87\%) and Large Language Models
with Quantized Low-Rank Approximation (F1-89\%), that significantly outperforms
traditional methods. BanFakeNews-2.0 offers a valuable resource to advance
research and application in fake news detection for low-resourced languages. We
publicly release our dataset and model on Github to foster research in this
direction.
☆ From Scarcity to Capability: Empowering Fake News Detection in
Low-Resource Languages with LLMs
☆ Stylomech: Unveiling Authorship via Computational Stylometry in English and Romanized Sinhala
With the advent of Web 2.0, the development in social technology coupled with
global communication systematically brought positive and negative impacts to
society. Copyright claims and Author identification are deemed crucial as there
has been a considerable amount of increase in content violation owing to the
lack of proper ethics in society. The Author's attribution in both English and
Romanized Sinhala became a major requirement in the last few decades. As an
area largely unexplored, particularly within the context of Romanized Sinhala,
the research contributes significantly to the field of computational
linguistics. The proposed author attribution system offers a unique approach,
allowing for the comparison of only two sets of text: suspect author and
anonymous text, a departure from traditional methodologies which often rely on
larger corpora. This work focuses on using the numerical representation of
various pairs of the same and different authors allowing for, the model to
train on these representations as opposed to text, this allows for it to apply
to a multitude of authors and contexts, given that the suspected author text,
and the anonymous text are of reasonable quality. By expanding the scope of
authorship attribution to encompass diverse linguistic contexts, the work
contributes to fostering trust and accountability in digital communication,
especially in Sri Lanka. This research presents a pioneering approach to author
attribution in both English and Romanized Sinhala, addressing a critical need
for content verification and intellectual property rights enforcement in the
digital age.
☆ Stylomech: Unveiling Authorship via Computational Stylometry in English
and Romanized Sinhala
comment: 3 figure, 1 image
☆ Analyzing Continuous Semantic Shifts with Diachronic Word Similarity Matrices COLING2025
The meanings and relationships of words shift over time. This phenomenon is
referred to as semantic shift.Research focused on understanding how semantic
shifts occur over multiple time periods is essential for gaining a detailed
understanding of semantic shifts.However, detecting change points only between
adjacent time periods is insufficient for analyzing detailed semantic shifts,
and using BERT-based methods to examine word sense proportions incurs a high
computational cost.To address those issues, we propose a simple yet intuitive
framework for how semantic shifts occur over multiple time periods by
leveraging a similarity matrix between the embeddings of the same word through
time.We compute a diachronic word similarity matrix using fast and lightweight
word embeddings across arbitrary time periods, making it deeper to analyze
continuous semantic shifts.Additionally, by clustering the similarity matrices
for different words, we can categorize words that exhibit similar behavior of
semantic shift in an unsupervised manner.
☆ Analyzing Continuous Semantic Shifts with Diachronic Word Similarity
Matrices COLING2025
comment: COLING2025
☆ Confidence Estimation for Error Detection in Text-to-SQL Systems AAAI 2025
Text-to-SQL enables users to interact with databases through natural
language, simplifying the retrieval and synthesis of information. Despite the
success of large language models (LLMs) in converting natural language
questions into SQL queries, their broader adoption is limited by two main
challenges: achieving robust generalization across diverse queries and ensuring
interpretative confidence in their predictions. To tackle these issues, our
research investigates the integration of selective classifiers into Text-to-SQL
systems. We analyse the trade-off between coverage and risk using entropy based
confidence estimation with selective classifiers and assess its impact on the
overall performance of Text-to-SQL models. Additionally, we explore the models'
initial calibration and improve it with calibration techniques for better model
alignment between confidence and accuracy. Our experimental results show that
encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and
decoder-only Llama 3, thus the designated external entropy-based selective
classifier has better performance. The study also reveal that, in terms of
error detection, selective classifier with a higher probability detects errors
associated with irrelevant questions rather than incorrect query generations.
☆ Confidence Estimation for Error Detection in Text-to-SQL Systems AAAI 2025
comment: 15 pages, 11 figures, to be published in AAAI 2025 Proceedings
☆ Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data
We present a method for augmenting a Large Language Model (LLM) with a
combination of text and visual data to enable accurate question answering in
visualization of scientific data, making conversational visualization possible.
LLMs struggle with tasks like visual data interaction, as they lack contextual
visual information. We address this problem by merging a text description of a
visualization and dataset with snapshots of the visualization. We extract their
essential features into a structured text file, highly compact, yet descriptive
enough to appropriately augment the LLM with contextual information, without
any fine-tuning. This approach can be applied to any visualization that is
already finally rendered, as long as it is associated with some textual
description.
☆ Augmenting a Large Language Model with a Combination of Text and Visual
Data for Conversational Visualization of Global Geospatial Data
☆ PIER: A Novel Metric for Evaluating What Matters in Code-Switching ICASSP 2025
Code-switching, the alternation of languages within a single discourse,
presents a significant challenge for Automatic Speech Recognition. Despite the
unique nature of the task, performance is commonly measured with established
metrics such as Word-Error-Rate (WER). However, in this paper, we question
whether these general metrics accurately assess performance on code-switching.
Specifically, using both Connectionist-Temporal-Classification and
Encoder-Decoder models, we show fine-tuning on non-code-switched data from both
matrix and embedded language improves classical metrics on code-switching test
sets, although actual code-switched words worsen (as expected). Therefore, we
propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only
on specific words of interest. We instantiate PIER on code-switched utterances
and show that this more accurately describes the code-switching performance,
showing huge room for improvement in future work. This focused evaluation
allows for a more precise assessment of model performance, particularly in
challenging aspects such as inter-word and intra-word code-switching.
☆ PIER: A Novel Metric for Evaluating What Matters in Code-Switching ICASSP 2025
comment: Accepted at ICASSP 2025
☆ Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators
Zhaocheng Liu, Quan Tu, Wen Ye, Yu Xiao, Zhishou Zhang, Hengfu Cui, Yalun Zhu, Qiang Ju, Shizheng Li, Jian Xie
Online medical consultation (OMC) restricts doctors to gathering patient
information solely through inquiries, making the already complex sequential
decision-making process of diagnosis even more challenging. Recently, the rapid
advancement of large language models has demonstrated a significant potential
to transform OMC. However, most studies have primarily focused on improving
diagnostic accuracy under conditions of relatively sufficient information,
while paying limited attention to the "inquiry" phase of the consultation
process. This lack of focus has left the relationship between "inquiry" and
"diagnosis" insufficiently explored. In this paper, we first extract real
patient interaction strategies from authentic doctor-patient conversations and
use these strategies to guide the training of a patient simulator that closely
mirrors real-world behavior. By inputting medical records into our patient
simulator to simulate patient responses, we conduct extensive experiments to
explore the relationship between "inquiry" and "diagnosis" in the consultation
process. Experimental results demonstrate that inquiry and diagnosis adhere to
the Liebig's law: poor inquiry quality limits the effectiveness of diagnosis,
regardless of diagnostic capability, and vice versa. Furthermore, the
experiments reveal significant differences in the inquiry performance of
various models. To investigate this phenomenon, we categorize the inquiry
process into four types: (1) chief complaint inquiry; (2) specification of
known symptoms; (3) inquiry about accompanying symptoms; and (4) gathering
family or medical history. We analyze the distribution of inquiries across the
four types for different models to explore the reasons behind their significant
performance differences. We plan to open-source the weights and related code of
our patient simulator at https://github.com/LIO-H-ZEN/PatientSimulator.
☆ Exploring the Inquiry-Diagnosis Relationship with Advanced Patient
Simulators
☆ Scaling Graph-Based Dependency Parsing with Arc Vectorization and Attention-Based Refinement
We propose a novel architecture for graph-based dependency parsing that
explicitly constructs vectors, from which both arcs and labels are scored. Our
method addresses key limitations of the standard two-pipeline approach by
unifying arc scoring and labeling into a single network, reducing scalability
issues caused by the information bottleneck and lack of parameter sharing.
Additionally, our architecture overcomes limited arc interactions with
transformer layers to efficiently simulate higher-order dependencies.
Experiments on PTB and UD show that our model outperforms state-of-the-art
parsers in both accuracy and efficiency.
☆ Scaling Graph-Based Dependency Parsing with Arc Vectorization and
Attention-Based Refinement
☆ Solving the unsolvable: Translating case law in Hong Kong
This paper addresses the challenges translating case law under Hong Kong's
bilingual legal system. It highlights the initial success of translating all
written statutes into Chinese before the 1997 handover, a task mandated by the
Basic Law. The effort involved significant collaboration among legal,
linguistic, and translation experts, resulting in a comprehensive and
culturally appropriate bilingual legal system. However, translating case law
remains a significant challenge due to the sheer volume and continuous growth
of judicial decisions. The paper critiques the governments and judiciarys
sporadic and uncoordinated efforts to translate case law, contrasting it with
the thorough approach previously taken for statute translation. Although the
government acknowledges the importance of legal bilingualism, it lacks a
sustainable strategy for translating case law. The Judiciarys position that
translating all judgments is unnecessary, unrealistic, and not cost-effectiveis
analyzed and critiqued for its impact on legal transparency and public trust. A
proposed solution involves leveraging machine translation technology through a
human-machine interactive translation platform, which undergoes two major
transitions. Initially based on a neural model, the platform transitions to
using a large language model for improved translation accuracy. Furthermore, it
evolves from a single-agent system to a multi-agent system, incorporating
Translator, Annotator, and Proofreader agents. This multi-agent approach,
supported by a grant, aims to facilitate efficient, high-quality translation of
judicial judgments by integrating advanced artificial intelligence and
continuous feedback mechanisms, thus better meeting the needs of a bilingual
legal system.
☆ Solving the unsolvable: Translating case law in Hong Kong
☆ A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy
Huandong Wang, Wenjie Fu, Yingzhou Tang, Zhilong Chen, Yuxi Huang, Jinghua Piao, Chen Gao, Fengli Xu, Tao Jiang, Yong Li
While large language models (LLMs) present significant potential for
supporting numerous real-world applications and delivering positive social
impacts, they still face significant challenges in terms of the inherent risk
of privacy leakage, hallucinated outputs, and value misalignment, and can be
maliciously used for generating toxic content and unethical purposes after been
jailbroken. Therefore, in this survey, we present a comprehensive review of
recent advancements aimed at mitigating these issues, organized across the four
phases of LLM development and usage: data collecting and pre-training,
fine-tuning and alignment, prompting and reasoning, and post-processing and
auditing. We elaborate on the recent advances for enhancing the performance of
LLMs in terms of privacy protection, hallucination reduction, value alignment,
toxicity elimination, and jailbreak defenses. In contrast to previous surveys
that focus on a single dimension of responsible LLMs, this survey presents a
unified framework that encompasses these diverse dimensions, providing a
comprehensive view of enhancing LLMs to better serve real-world applications.
☆ A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and
Mitigation Strategy
☆ AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling
Ancheng Xu, Di Yang, Renhao Li, Jingwei Zhu, Minghuan Tan, Min Yang, Wanxin Qiu, Mingchen Ma, Haihong Wu, Bingyu Li, Feng Sha, Chengming Li, Xiping Hu, Qiang Qu, Derek F. Wong, Ruifeng Xu
Traditional in-person psychological counseling remains primarily niche, often
chosen by individuals with psychological issues, while online automated
counseling offers a potential solution for those hesitant to seek help due to
feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and
widely used approach in psychological counseling. The advent of large language
models (LLMs) and agent technology enables automatic CBT diagnosis and
treatment. However, current LLM-based CBT systems use agents with a fixed
structure, limiting their self-optimization capabilities, or providing hollow,
unhelpful suggestions due to redundant response patterns. In this work, we
utilize Quora-like and YiXinLi single-round consultation models to build a
general agent framework that generates high-quality responses for single-turn
psychological consultation scenarios. We use a bilingual dataset to evaluate
the quality of single-response consultations generated by each framework. Then,
we incorporate dynamic routing and supervisory mechanisms inspired by real
psychological counseling to construct a CBT-oriented autonomous multi-agent
framework, demonstrating its general applicability. Experimental results
indicate that AutoCBT can provide higher-quality automated psychological
counseling services.
☆ AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral
Therapy in Psychological Counseling
☆ Vision-Language Models Do Not Understand Negation
Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi
Many practical vision-language applications require models that understand
negation, e.g., when using natural language to retrieve images which contain
certain objects but not others. Despite advancements in vision-language models
(VLMs) through large-scale training, their ability to comprehend negation
remains underexplored. This study addresses the question: how well do current
VLMs understand negation? We introduce NegBench, a new benchmark designed to
evaluate negation understanding across 18 task variations and 79k examples
spanning image, video, and medical datasets. The benchmark consists of two core
tasks designed to evaluate negation understanding in diverse multimodal
settings: Retrieval with Negation and Multiple Choice Questions with Negated
Captions. Our evaluation reveals that modern VLMs struggle significantly with
negation, often performing at chance level. To address these shortcomings, we
explore a data-centric approach wherein we finetune CLIP models on large-scale
synthetic datasets containing millions of negated captions. We show that this
approach can result in a 10% increase in recall on negated queries and a 40%
boost in accuracy on multiple-choice questions with negated captions.
☆ Vision-Language Models Do Not Understand Negation
comment: Project page: https://negbench.github.io
☆ mGeNTE: A Multilingual Resource for Gender-Neutral Language and Translation
Gender-neutral language reflects societal and linguistic shifts towards
greater inclusivity by avoiding the implication that one gender is the norm
over others. This is particularly relevant for grammatical gender languages,
which heavily encode the gender of terms for human referents and over-relies on
masculine forms, even when gender is unspecified or irrelevant. Language
technologies are known to mirror these inequalities, being affected by a male
bias and perpetuating stereotypical associations when translating into
languages with extensive gendered morphology. In such cases, gender-neutral
language can help avoid undue binary assumptions. However, despite its
importance for creating fairer multi- and cross-lingual technologies, inclusive
language research remains scarce and insufficiently supported in current
resources. To address this gap, we present the multilingual mGeNTe dataset.
Derived from the bilingual GeNTE (Piergentili et al., 2023), mGeNTE extends the
original corpus to include the English-Italian/German/Spanish language pairs.
Since each language pair is English-aligned with gendered and neutral sentences
in the target languages, mGeNTE enables research in both automatic
Gender-Neutral Translation (GNT) and language modelling for three grammatical
gender languages.
☆ mGeNTE: A Multilingual Resource for Gender-Neutral Language and
Translation
☆ Evaluating LLM Abilities to Understand Tabular Electronic Health Records: A Comprehensive Study of Patient Data Extraction and Retrieval ECIR
Electronic Health Record (EHR) tables pose unique challenges among which is
the presence of hidden contextual dependencies between medical features with a
high level of data dimensionality and sparsity. This study presents the first
investigation into the abilities of LLMs to comprehend EHRs for patient data
extraction and retrieval. We conduct extensive experiments using the MIMICSQL
dataset to explore the impact of the prompt structure, instruction, context,
and demonstration, of two backbone LLMs, Llama2 and Meditron, based on task
performance. Through quantitative and qualitative analyses, our findings show
that optimal feature selection and serialization methods can enhance task
performance by up to 26.79% compared to naive approaches. Similarly, in-context
learning setups with relevant example selection improve data extraction
performance by 5.95%. Based on our study findings, we propose guidelines that
we believe would help the design of LLM-based models to support health search.
☆ Evaluating LLM Abilities to Understand Tabular Electronic Health
Records: A Comprehensive Study of Patient Data Extraction and Retrieval ECIR
comment: To be published as full paper in the Proceedings of the European
Conference on Information Retrieval (ECIR) 2025. Preprint
☆ ChartInsighter: An Approach for Mitigating Hallucination in Time-series Chart Summary Generation with A Benchmark Dataset
Effective chart summary can significantly reduce the time and effort decision
makers spend interpreting charts, enabling precise and efficient communication
of data insights. Previous studies have faced challenges in generating accurate
and semantically rich summaries of time-series data charts. In this paper, we
identify summary elements and common hallucination types in the generation of
time-series chart summaries, which serve as our guidelines for automatic
generation. We introduce ChartInsighter, which automatically generates chart
summaries of time-series data, effectively reducing hallucinations in chart
summary generation. Specifically, we assign multiple agents to generate the
initial chart summary and collaborate iteratively, during which they invoke
external data analysis modules to extract insights and compile them into a
coherent summary. Additionally, we implement a self-consistency test method to
validate and correct our summary. We create a high-quality benchmark of charts
and summaries, with hallucination types annotated on a sentence-by-sentence
basis, facilitating the evaluation of the effectiveness of reducing
hallucinations. Our evaluations using our benchmark show that our method
surpasses state-of-the-art models, and that our summary hallucination rate is
the lowest, which effectively reduces various hallucinations and improves
summary quality. The benchmark is available at
https://github.com/wangfen01/ChartInsighter.
☆ ChartInsighter: An Approach for Mitigating Hallucination in Time-series
Chart Summary Generation with A Benchmark Dataset
☆ Algorithm for Semantic Network Generation from Texts of Low Resource Languages Such as Kiswahili
Processing low-resource languages, such as Kiswahili, using machine learning
is difficult due to lack of adequate training data. However, such low-resource
languages are still important for human communication and are already in daily
use and users need practical machine processing tasks such as summarization,
disambiguation and even question answering (QA). One method of processing such
languages, while bypassing the need for training data, is the use semantic
networks. Some low resource languages, such as Kiswahili, are of the
subject-verb-object (SVO) structure, and similarly semantic networks are a
triple of subject-predicate-object, hence SVO parts of speech tags can map into
a semantic network triple. An algorithm to process raw natural language text
and map it into a semantic network is therefore necessary and desirable in
structuring low resource languages texts. This algorithm tested on the
Kiswahili QA task with upto 78.6% exact match.
☆ Algorithm for Semantic Network Generation from Texts of Low Resource
Languages Such as Kiswahili
comment: 18 pages, 3 figures, published in Open Journal for Information
Technology
☆ Shape-Based Single Object Classification Using Ensemble Method Classifiers
Nowadays, more and more images are available. Annotation and retrieval of the
images pose classification problems, where each class is defined as the group
of database images labelled with a common semantic label. Various systems have
been proposed for content-based retrieval, as well as for image classification
and indexing. In this paper, a hierarchical classification framework has been
proposed for bridging the semantic gap effectively and achieving multi-category
image classification. A well known pre-processing and post-processing method
was used and applied to three problems; image segmentation, object
identification and image classification. The method was applied to classify
single object images from Amazon and Google datasets. The classification was
tested for four different classifiers; BayesNetwork (BN), Random Forest (RF),
Bagging and Vote. The estimated classification accuracies ranged from 20% to
99% (using 10-fold cross validation). The Bagging classifier presents the best
performance, followed by the Random Forest classifier.
☆ Shape-Based Single Object Classification Using Ensemble Method
Classifiers
☆ A Study of In-Context-Learning-Based Text-to-SQL Errors
Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, Geguang Pu
Large language models (LLMs) have been adopted to perform text-to-SQL tasks,
utilizing their in-context learning (ICL) capability to translate natural
language questions into structured query language (SQL). However, such a
technique faces correctness problems and requires efficient repairing
solutions. In this paper, we conduct the first comprehensive study of
text-to-SQL errors. Our study covers four representative ICL-based techniques,
five basic repairing methods, two benchmarks, and two LLM settings. We find
that text-to-SQL errors are widespread and summarize 29 error types of 7
categories. We also find that existing repairing attempts have limited
correctness improvement at the cost of high computational overhead with many
mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL
error detection and repairing framework. The evaluation demonstrates that
MapleRepair outperforms existing solutions by repairing 13.8% more queries with
neglectable mis-repairs and 67.4% less overhead.
☆ A Study of In-Context-Learning-Based Text-to-SQL Errors
☆ Understanding Mental Health Content on Social Media and Its Effect Towards Suicidal Ideation
This review underscores the critical need for effective strategies to
identify and support individuals with suicidal ideation, exploiting
technological innovations in ML and DL to further suicide prevention efforts.
The study details the application of these technologies in analyzing vast
amounts of unstructured social media data to detect linguistic patterns,
keywords, phrases, tones, and contextual cues associated with suicidal
thoughts. It explores various ML and DL models like SVMs, CNNs, LSTM, neural
networks, and their effectiveness in interpreting complex data patterns and
emotional nuances within text data. The review discusses the potential of these
technologies to serve as a life-saving tool by identifying at-risk individuals
through their digital traces. Furthermore, it evaluates the real-world
effectiveness, limitations, and ethical considerations of employing these
technologies for suicide prevention, stressing the importance of responsible
development and usage. The study aims to fill critical knowledge gaps by
analyzing recent studies, methodologies, tools, and techniques in this field.
It highlights the importance of synthesizing current literature to inform
practical tools and suicide prevention efforts, guiding innovation in reliable,
ethical systems for early intervention. This research synthesis evaluates the
intersection of technology and mental health, advocating for the ethical and
responsible application of ML, DL, and NLP to offer life-saving potential
worldwide while addressing challenges like generalizability, biases, privacy,
and the need for further research to ensure these technologies do not
exacerbate existing inequities and harms.
☆ Understanding Mental Health Content on Social Media and Its Effect
Towards Suicidal Ideation
☆ Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning
Few-shot learning in medical image classification presents a significant
challenge due to the limited availability of annotated data and the complex
nature of medical imagery. In this work, we propose Adaptive Vision-Language
Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework
that leverages the capabilities of Large Vision-Language Models (LVLMs) for
medical image analysis. HiCA introduces a two-stage fine-tuning strategy,
combining domain-specific pretraining and hierarchical contrastive learning to
align visual and textual representations at multiple levels. We evaluate our
approach on two benchmark datasets, Chest X-ray and Breast Ultrasound,
achieving state-of-the-art performance in both few-shot and zero-shot settings.
Further analyses demonstrate the robustness, generalizability, and
interpretability of our method, with substantial improvements in performance
compared to existing baselines. Our work highlights the potential of
hierarchical contrastive strategies in adapting LVLMs to the unique challenges
of medical imaging tasks.
☆ Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive
Vision-Language Learning
☆ To Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic Retrieval Augmented Generation
Retrieval-Augmented Generation equips large language models with the
capability to retrieve external knowledge, thereby mitigating hallucinations by
incorporating information beyond the model's intrinsic abilities. However, most
prior works have focused on invoking retrieval deterministically, which makes
it unsuitable for tasks such as long-form question answering. Instead,
dynamically performing retrieval by invoking it only when the underlying LLM
lacks the required knowledge can be more efficient. In this context, we delve
deeper into the question, "To Retrieve or Not to Retrieve?" by exploring
multiple uncertainty detection methods. We evaluate these methods for the task
of long-form question answering, employing dynamic retrieval, and present our
comparisons. Our findings suggest that uncertainty detection metrics, such as
Degree Matrix Jaccard and Eccentricity, can reduce the number of retrieval
calls by almost half, with only a slight reduction in question-answering
accuracy.
☆ To Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic
Retrieval Augmented Generation
☆ Perspective Transition of Large Language Models for Solving Subjective Tasks
Large language models (LLMs) have revolutionized the field of natural
language processing, enabling remarkable progress in various tasks. Different
from objective tasks such as commonsense reasoning and arithmetic
question-answering, the performance of LLMs on subjective tasks is still
limited, where the perspective on the specific problem plays crucial roles for
better interpreting the context and giving proper response. For example, in
certain scenarios, LLMs may perform better when answering from an expert role
perspective, potentially eliciting their relevant domain knowledge. In
contrast, in some scenarios, LLMs may provide more accurate responses when
answering from a third-person standpoint, enabling a more comprehensive
understanding of the problem and potentially mitigating inherent biases. In
this paper, we propose Reasoning through Perspective Transition (RPT), a method
based on in-context learning that enables LLMs to dynamically select among
direct, role, and third-person perspectives for the best way to solve
corresponding subjective problem. Through extensive experiments on totally 12
subjective tasks by using both closed-source and open-source LLMs including
GPT-4, GPT-3.5, Llama-3, and Qwen-2, our method outperforms widely used single
fixed perspective based methods such as chain-of-thought prompting and expert
prompting, highlights the intricate ways that LLMs can adapt their perspectives
to provide nuanced and contextually appropriate responses for different
problems.
☆ Perspective Transition of Large Language Models for Solving Subjective
Tasks
☆ Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition ICASSP2025
This paper presents an efficient decoding approach for end-to-end automatic
speech recognition (E2E-ASR) with large language models (LLMs). Although
shallow fusion is the most common approach to incorporate language models into
E2E-ASR decoding, we face two practical problems with LLMs. (1) LLM inference
is computationally costly. (2) There may be a vocabulary mismatch between the
ASR model and the LLM. To resolve this mismatch, we need to retrain the ASR
model and/or the LLM, which is at best time-consuming and in many cases not
feasible. We propose "delayed fusion," which applies LLM scores to ASR
hypotheses with a delay during decoding and enables easier use of pre-trained
LLMs in ASR tasks. This method can reduce not only the number of hypotheses
scored by the LLM but also the number of LLM inference calls. It also allows
re-tokenizion of ASR hypotheses during decoding if ASR and LLM employ different
tokenizations. We demonstrate that delayed fusion provides improved decoding
speed and accuracy compared to shallow fusion and N-best rescoring using the
LibriHeavy ASR corpus and three public LLMs, OpenLLaMA 3B & 7B and Mistral 7B.
☆ Delayed Fusion: Integrating Large Language Models into First-Pass
Decoding in End-to-end Speech Recognition ICASSP2025
comment: Accepted to ICASSP2025
☆ Foundations of Large Language Models
This is a book about large language models. As indicated by the title, it
primarily focuses on foundational concepts rather than comprehensive coverage
of all cutting-edge technologies. The book is structured into four main
chapters, each exploring a key area: pre-training, generative models, prompting
techniques, and alignment methods. It is intended for college students,
professionals, and practitioners in natural language processing and related
fields, and can serve as a reference for anyone interested in large language
models.
☆ Foundations of Large Language Models
☆ A Simple Graph Contrastive Learning Framework for Short Text Classification AAAI2025
Short text classification has gained significant attention in the information
age due to its prevalence and real-world applications. Recent advancements in
graph learning combined with contrastive learning have shown promising results
in addressing the challenges of semantic sparsity and limited labeled data in
short text classification. However, existing models have certain limitations.
They rely on explicit data augmentation techniques to generate contrastive
views, resulting in semantic corruption and noise. Additionally, these models
only focus on learning the intrinsic consistency between the generated views,
neglecting valuable discriminative information from other potential views. To
address these issues, we propose a Simple graph contrastive learning framework
for Short Text Classification (SimSTC). Our approach involves performing graph
learning on multiple text-related component graphs to obtain multi-view text
embeddings. Subsequently, we directly apply contrastive learning on these
embeddings. Notably, our method eliminates the need for data augmentation
operations to generate contrastive views while still leveraging the benefits of
multi-view contrastive learning. Despite its simplicity, our model achieves
outstanding performance, surpassing large language models on various datasets.
☆ A Simple Graph Contrastive Learning Framework for Short Text
Classification AAAI2025
comment: AAAI2025
☆ Boosting Short Text Classification with Multi-Source Information Exploration and Dual-Level Contrastive Learning AAAI2025
Short text classification, as a research subtopic in natural language
processing, is more challenging due to its semantic sparsity and insufficient
labeled samples in practical scenarios. We propose a novel model named
MI-DELIGHT for short text classification in this work. Specifically, it first
performs multi-source information (i.e., statistical information, linguistic
information, and factual information) exploration to alleviate the sparsity
issues. Then, the graph learning approach is adopted to learn the
representation of short texts, which are presented in graph forms. Moreover, we
introduce a dual-level (i.e., instance-level and cluster-level) contrastive
learning auxiliary task to effectively capture different-grained contrastive
information within massive unlabeled data. Meanwhile, previous models merely
perform the main task and auxiliary tasks in parallel, without considering the
relationship among tasks. Therefore, we introduce a hierarchical architecture
to explicitly model the correlations between tasks. We conduct extensive
experiments across various benchmark datasets, demonstrating that MI-DELIGHT
significantly surpasses previous competitive models. It even outperforms
popular large language models on several datasets.
☆ Boosting Short Text Classification with Multi-Source Information
Exploration and Dual-Level Contrastive Learning AAAI2025
comment: AAAI2025
☆ FineMedLM-o1: Enhancing the Medical Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training
Recent advancements in large language models (LLMs) have shown promise in
medical applications such as disease diagnosis and treatment planning. However,
most existing medical LLMs struggle with the advanced reasoning required for
complex clinical scenarios, such as differential diagnosis or personalized
treatment suggestions. We proposed FineMedLM-o1, which leverages high-quality
synthetic medical data and long-form reasoning data for Supervised Fine-Tuning
(SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and
deep reasoning capabilities. Additionally, we introduced Test-Time Training
(TTT) in the medical domain for the first time, facilitating domain adaptation
and ensuring reliable, accurate reasoning. Experimental results demonstrate
that FineMedLM-o1 achieves a 23% average performance improvement over prior
models on key medical benchmarks. Furthermore, the introduction of TTT provides
an additional 14% performance boost, highlighting its effectiveness in
enhancing medical reasoning capabilities. To support this process, we also
proposed a novel method for synthesizing medical dialogue. Compared to other
open-source datasets, our dataset stands out as superior in both quality and
complexity. The project and data will be released on GitHub.
☆ FineMedLM-o1: Enhancing the Medical Reasoning Ability of LLM from
Supervised Fine-Tuning to Test-Time Training
♻ ☆ NL2KQL: From Natural Language to Kusto Query
Xinye Tang, Amir H. Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, Ye Xing
Data is growing rapidly in volume and complexity. Proficiency in database
query languages is pivotal for crafting effective queries. As coding assistants
become more prevalent, there is significant opportunity to enhance database
query languages. The Kusto Query Language (KQL) is a widely used query language
for large semi-structured data such as logs, telemetries, and time-series for
big data analytics platforms. This paper introduces NL2KQL an innovative
framework that uses large language models (LLMs) to convert natural language
queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several
key components: Schema Refiner which narrows down the schema to its most
pertinent elements; the Few-shot Selector which dynamically selects relevant
examples from a few-shot dataset; and the Query Refiner which repairs syntactic
and semantic errors in KQL queries. Additionally, this study outlines a method
for generating large datasets of synthetic NLQ-KQL pairs which are valid within
a specific database contexts. To validate NL2KQL's performance, we utilize an
array of online (based on query execution) and offline (based on query parsing)
metrics. Through ablation studies, the significance of each framework component
is examined, and the datasets used for benchmarking are made publicly
available. This work is the first of its kind and is compared with available
baselines to demonstrate its effectiveness.
♻ ☆ NL2KQL: From Natural Language to Kusto Query
♻ ☆ Aligning Brain Activity with Advanced Transformer Models: Exploring the Role of Punctuation in Semantic Processing
This research examines the congruence between neural activity and advanced
transformer models, emphasizing the semantic significance of punctuation in
text understanding. Utilizing an innovative approach originally proposed by
Toneva and Wehbe, we evaluate four advanced transformer models RoBERTa,
DistiliBERT, ALBERT, and ELECTRA against neural activity data. Our findings
indicate that RoBERTa exhibits the closest alignment with neural activity,
surpassing BERT in accuracy. Furthermore, we investigate the impact of
punctuation removal on model performance and neural alignment, revealing that
BERT's accuracy enhances in the absence of punctuation. This study contributes
to the comprehension of how neural networks represent language and the
influence of punctuation on semantic processing within the human brain.
♻ ☆ Aligning Brain Activity with Advanced Transformer Models: Exploring the
Role of Punctuation in Semantic Processing
♻ ☆ ReFactor GNNs: Revisiting Factorisation-based Models from a Message-Passing Perspective NeurIPS 2022
Yihong Chen, Pushkar Mishra, Luca Franceschi, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel
Factorisation-based Models (FMs), such as DistMult, have enjoyed enduring
success for Knowledge Graph Completion (KGC) tasks, often outperforming Graph
Neural Networks (GNNs). However, unlike GNNs, FMs struggle to incorporate node
features and generalise to unseen nodes in inductive settings. Our work bridges
the gap between FMs and GNNs by proposing ReFactor GNNs. This new architecture
draws upon both modelling paradigms, which previously were largely thought of
as disjoint. Concretely, using a message-passing formalism, we show how FMs can
be cast as GNNs by reformulating the gradient descent procedure as
message-passing operations, which forms the basis of our ReFactor GNNs. Across
a multitude of well-established KGC benchmarks, our ReFactor GNNs achieve
comparable transductive performance to FMs, and state-of-the-art inductive
performance while using an order of magnitude fewer parameters.
♻ ☆ ReFactor GNNs: Revisiting Factorisation-based Models from a
Message-Passing Perspective NeurIPS
2022
comment: 36th Conference on Neural Information Processing Systems (NeurIPS
2022)
♻ ☆ PolInterviews -- A Dataset of German Politician Public Broadcast Interviews
This paper presents a novel dataset of public broadcast interviews featuring
high-ranking German politicians. The interviews were sourced from YouTube,
transcribed, processed for speaker identification, and stored in a tidy and
open format. The dataset comprises 99 interviews with 33 different German
politicians across five major interview formats, containing a total of 28,146
sentences. As the first of its kind, this dataset offers valuable opportunities
for research on various aspects of political communication in the (German)
political contexts, such as agenda-setting, interviewer dynamics, or
politicians' self-presentation.
♻ ☆ PolInterviews -- A Dataset of German Politician Public Broadcast
Interviews
♻ ☆ Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework
Bohao Yang, Dong Liu, Chenghao Xiao, Kun Zhao, Chen Tang, Chao Li, Lin Yuan, Guang Yang, Lanxiao Huang, Chenghua Lin
Large Language Models (LLMs) demonstrate remarkable ability to comprehend
instructions and generate human-like text, enabling sophisticated agent
simulation beyond basic behavior replication. However, the potential for
creating freely customisable characters remains underexplored. We introduce the
Customisable Conversation Agent Framework, which employs LLMs to simulate
real-world characters through personalised characteristic feature injection,
enabling diverse character creation according to user preferences. We propose
the SimsConv dataset, comprising 68 customised characters and 13,971 multi-turn
role-playing dialogues across 1,360 real-world scenes. Characters are initially
customised using pre-defined elements (career, aspiration, traits, skills),
then expanded through personal and social profiles. Building on this, we
present SimsChat, a freely customisable role-playing agent incorporating
various realistic settings and topic-specified character interactions.
Experimental results on both SimsConv and WikiRoleEval datasets demonstrate
SimsChat's superior performance in maintaining character consistency, knowledge
accuracy, and appropriate question rejection compared to existing models. Our
framework provides valuable insights for developing more accurate and
customisable human simulacra. Our data and code are publicly available at
https://github.com/Bernard-Yang/SimsChat.
♻ ☆ Crafting Customisable Characters with LLMs: Introducing SimsChat, a
Persona-Driven Role-Playing Agent Framework
♻ ☆ Can linguists better understand DNA?
Multilingual transfer ability, which reflects how well models fine-tuned on
one source language can be applied to other languages, has been well studied in
multilingual pre-trained models. However, the existence of such capability
transfer between natural language and gene sequences/languages remains under
explored.This study addresses this gap by drawing inspiration from the
sentence-pair classification task used for evaluating sentence similarity in
natural language. We constructed two analogous tasks: DNA-pair
classification(DNA sequence similarity) and DNA-protein-pair
classification(gene coding determination). These tasks were designed to
validate the transferability of capabilities from natural language to gene
sequences. Even a small-scale pre-trained model like GPT-2-small, which was
pre-trained on English, achieved an accuracy of 78% on the DNA-pair
classification task after being fine-tuned on English sentence-pair
classification data(XTREME PAWS-X). While training a BERT model on multilingual
text, the precision reached 89%. On the more complex DNA-protein-pair
classification task, however, the model's output was barely distinguishable
from random output.Experimental validation has confirmed that the transfer of
capabilities from natural language to biological language is unequivocally
present. Building on this foundation, we have also investigated the impact of
model parameter scale and pre-training on this capability transfer. We provide
recommendations for facilitating the transfer of capabilities from natural
language to genetic language,as well as new approaches for conducting
biological research based on this capability.This study offers an intriguing
new perspective on exploring the relationship between natural language and
genetic language.
♻ ☆ Can linguists better understand DNA?
comment: 20 pages,7 figures
♻ ☆ aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing ICSE 2025
Siyuan Jiang, Jia Li, He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu Li, Jiazheng Ding, Yu Han, Wei Ning, Gen Wang, Yihong Dong, Kechi Zhang, Ge Li
Large Language Models (LLMs) have been widely used in code completion, and
researchers are focusing on scaling up LLMs to improve their accuracy. However,
larger LLMs have lower inference efficiency, affecting developers' experience
and productivity. In this paper, we propose a lightweight and effective LLM for
code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B
achieves higher code completion accuracy while having smaller scales (i.e., 7
billion parameters). We attribute the superiority of aiXcoder-7B to three key
factors: (1) Multi-objective training. We employ three training objectives, one
of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers
the syntax structures in code and effectively improves the performance of LLMs
for code. (2) Diverse data sampling strategies. They consider inter-file
relationships and enhance the capability of LLMs in understanding cross-file
contexts. (3) Extensive high-quality data. We establish a rigorous data
collection pipeline and consume a total of 1.2 trillion unique tokens for
training aiXcoder-7B. This vast volume of data enables aiXcoder-7B to learn a
broad distribution of code. We evaluate aiXcoder-7B in five popular code
completion benchmarks and a new benchmark collected by this paper. The results
show that aiXcoder-7B outperforms the latest six LLMs with similar sizes and
even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLlama-34B),
positioning aiXcoder-7B as a lightweight and effective LLM for academia and
industry. Finally, we summarize three valuable insights for helping
practitioners train the next generations of LLMs for code. aiXcoder-7B has been
open-souced and gained significant attention. Until January 2025, aiXcoder-7B
has received 2,226 GitHub Stars.
♻ ☆ aiXcoder-7B: A Lightweight and Effective Large Language Model for Code
Processing ICSE 2025
comment: (1) Accepted by the 47th International Conference on Software
Engineering (ICSE 2025). (2) aiXcoder-7B is available at
https://github.com/aixcoder-plugin/aiXcoder-7B
♻ ☆ AudioBERT: Audio Knowledge Augmented Language Model ICASSP 2025
Recent studies have identified that language models, pretrained on text-only
datasets, often lack elementary visual knowledge, \textit{e.g.,} colors of
everyday objects. Motivated by this observation, we ask whether a similar
shortcoming exists in terms of the \textit{auditory} knowledge. To answer this
question, we construct a new dataset called AuditoryBench, which consists of
two novel tasks for evaluating auditory knowledge. Based on our analysis using
the benchmark, we find that language models also suffer from a severe lack of
auditory knowledge. To address this limitation, we propose AudioBERT, a novel
method to augment the auditory knowledge of BERT through a retrieval-based
approach. First, we detect auditory knowledge spans in prompts to query our
retrieval model efficiently. Then, we inject audio knowledge into BERT and
switch on low-rank adaptation for effective adaptation when audio knowledge is
required. Our experiments demonstrate that AudioBERT is quite effective,
achieving superior performance on the AuditoryBench. The dataset and code are
available at \bulurl{https://github.com/HJ-Ok/AudioBERT}.
♻ ☆ AudioBERT: Audio Knowledge Augmented Language Model ICASSP 2025
comment: 5 pages, 3 figures, ICASSP 2025
♻ ☆ Focus On This, Not That! Steering LLMs With Adaptive Feature Specification
Despite the success of Instruction Tuning (IT) in training large language
models (LLMs) to perform arbitrary user-specified tasks, these models often
still leverage spurious or biased features learned from their training data,
leading to undesired behaviours when deploying them in new contexts. In this
work, we introduce Focus Instruction Tuning (FIT), which trains LLMs to
condition their responses by focusing on specific features whilst ignoring
others, leading to different behaviours based on what features are specified.
Across several experimental settings, we show that focus-tuned models can be
adaptively steered by focusing on different features at inference-time: for
instance, robustness can be improved by focusing on task-causal features and
ignoring spurious features, and social bias can be mitigated by ignoring
demographic categories. Furthermore, FIT can steer behaviour in new contexts,
generalising under distribution shift and to new unseen features at inference
time, and thereby facilitating more robust, fair, and controllable LLM
applications in real-world environments.
♻ ☆ Focus On This, Not That! Steering LLMs With Adaptive Feature
Specification
comment: 28pages, 14 figures
♻ ☆ Leveraging Fine-Tuned Retrieval-Augmented Generation with Long-Context Support: For 3GPP Standards
Recent studies show that large language models (LLMs) struggle with technical
standards in telecommunications. We propose a fine-tuned retrieval-augmented
generation (RAG) system based on the Phi-2 small language model (SLM) to serve
as an oracle for communication networks. Our developed system leverages
forward-looking semantic chunking to adaptively determine parsing breakpoints
based on embedding similarity, enabling effective processing of diverse
document formats. To handle the challenge of multiple similar contexts in
technical standards, we employ a re-ranking algorithm to prioritize the most
relevant retrieved chunks. Recognizing the limitations of Phi-2's small context
window, we implement a recent technique, namely SelfExtend, to expand the
context window during inference, which not only boosts the performance but also
can accommodate a wider range of user queries and design requirements from
customers to specialized technicians. For fine-tuning, we utilize the low-rank
adaptation (LoRA) technique to enhance computational efficiency during training
and enable effective fine-tuning on small datasets. Our comprehensive
experiments demonstrate substantial improvements over existing
question-answering approaches in the telecom domain, achieving performance that
exceeds larger language models such as GPT-4 (which is about 880 times larger
in size). This work presents a novel approach to leveraging SLMs for
communication networks, offering a balance of efficiency and performance. This
work can serve as a foundation towards agentic language models for networks.
♻ ☆ Leveraging Fine-Tuned Retrieval-Augmented Generation with Long-Context
Support: For 3GPP Standards
comment: submitted to Proc. IEEE Globecom
♻ ☆ RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems
Retrieval-Augmented Generation (RAG) has become a standard architectural
pattern for incorporating domain-specific knowledge into user-facing chat
applications powered by Large Language Models (LLMs). RAG systems are
characterized by (1) a document retriever that queries a domain-specific corpus
for context information relevant to an input query, and (2) an LLM that
generates a response based on the provided query and context. However,
comprehensive evaluation of RAG systems remains a challenge due to the lack of
unified evaluation criteria and annotated datasets. In response, we introduce
RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k
examples. It covers five unique industry-specific domains and various RAG task
types. RAGBench examples are sourced from industry corpora such as user
manuals, making it particularly relevant for industry applications. Further, we
formalize the TRACe evaluation framework: a set of explainable and actionable
RAG evaluation metrics applicable across all RAG domains. We release the
labeled dataset at https://huggingface.co/datasets/rungalileo/ragbench.
RAGBench explainable labels facilitate holistic evaluation of RAG systems,
enabling actionable feedback for continuous improvement of production
applications. Thorough extensive benchmarking, we find that LLM-based RAG
evaluation methods struggle to compete with a finetuned RoBERTa model on the
RAG evaluation task. We identify areas where existing approaches fall short and
propose the adoption of RAGBench with TRACe towards advancing the state of RAG
evaluation systems.
♻ ☆ RAGBench: Explainable Benchmark for Retrieval-Augmented Generation
Systems
♻ ☆ SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words NeurIPS 2024
Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu
Speech encompasses a wealth of information, including but not limited to
content, paralinguistic, and environmental information. This comprehensive
nature of speech significantly impacts communication and is crucial for
human-computer interaction. Chat-Oriented Large Language Models (LLMs), known
for their general-purpose assistance capabilities, have evolved to handle
multi-modal inputs, including speech. Although these models can be adept at
recognizing and analyzing speech, they often fall short of generating
appropriate responses. We argue that this is due to the lack of principles on
task definition and model development, which requires open-source datasets and
metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a
benchmark dataset aimed at multidimensional evaluation of spoken dialogue
understanding and generation. SD-Eval focuses on paralinguistic and
environmental information and includes 7,303 utterances, amounting to 8.76
hours of speech data. The data is aggregated from eight public datasets,
representing four perspectives: emotion, accent, age, and background sound. To
assess the SD-Eval benchmark dataset, we implement three different models and
construct a training set following a process similar to that of SD-Eval. The
training set contains 1,052.72 hours of speech data and 724.4k utterances. We
also conduct a comprehensive evaluation using objective evaluation methods
(e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the
generated responses. Models conditioned with paralinguistic and environmental
information outperform their counterparts in both objective and subjective
measures. Moreover, experiments demonstrate that LLM-based metrics show a
higher correlation with human evaluation compared to traditional metrics. We
open-source SD-Eval at https://github.com/amphionspace/SD-Eval.
♻ ☆ SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond
Words NeurIPS 2024
comment: Accepted to NeurIPS 2024
♻ ☆ Discriminative Representation learning via Attention-Enhanced Contrastive Learning for Short Text Clustering
Contrastive learning has gained significant attention in short text
clustering, yet it has an inherent drawback of mistakenly identifying samples
from the same category as negatives and then separating them in the feature
space (false negative separation), which hinders the generation of superior
representations. To generate more discriminative representations for efficient
clustering, we propose a novel short text clustering method, called
Discriminative Representation learning via \textbf{A}ttention-\textbf{E}nhanced
\textbf{C}ontrastive \textbf{L}earning for Short Text Clustering
(\textbf{AECL}). The \textbf{AECL} consists of two modules which are the
pseudo-label generation module and the contrastive learning module. Both
modules build a sample-level attention mechanism to capture similarity
relationships between samples and aggregate cross-sample features to generate
consistent representations. Then, the former module uses the more
discriminative consistent representation to produce reliable supervision
information for assist clustering, while the latter module explores similarity
relationships and consistent representations optimize the construction of
positive samples to perform similarity-guided contrastive learning, effectively
addressing the false negative separation issue. Experimental results
demonstrate that the proposed \textbf{AECL} outperforms state-of-the-art
methods. If the paper is accepted, we will open-source the code.
♻ ☆ Discriminative Representation learning via Attention-Enhanced
Contrastive Learning for Short Text Clustering
♻ ☆ Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Recent advancements in language models (LMs) have sparked growing interest in
developing LM agents. While fully autonomous agents could excel in many
scenarios, numerous use cases inherently require them to collaborate with
humans due to humans' latent preferences, domain expertise, or need for
control. To facilitate the study of human-agent collaboration, we present
Collaborative Gym (Co-Gym), a general framework enabling asynchronous,
tripartite interaction among agents, humans, and task environments. We
instantiate Co-Gym with three representative tasks in both simulated and
real-world conditions, and propose an evaluation framework that assesses both
the collaboration outcomes and processes. Our findings reveal that
collaborative agents consistently outperform their fully autonomous
counterparts in task performance within those delivered cases, achieving win
rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related
Work when evaluated by real users. However, our study also highlights
significant challenges in developing collaborative agents, requiring
advancements in core aspects of intelligence -- communication capabilities,
situational awareness, and balancing autonomy and human control.
♻ ☆ Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent
Collaboration
comment: Preprint. Work in progress
♻ ☆ MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish
Multilingual large language models (MLLMs) have shown impressive capabilities
across a variety of languages. However, efficacy can differ greatly between
different language families, especially for those with limited linguistic
resources. This report presents MERaLiON-TextLLM, a series of open-source
language models specifically tailored to improve understanding and generation
in Chinese, Indonesian, Malay, and Singlish. The initial released model is
built on Llama-3-8B-Base and refined through a meticulously crafted process of
continued pre-training and weight merging. Our approach achieves performance
improvements across benchmarks in these languages, exceeding the capabilities
of the official Llama-3 models. We provide the model checkpoints as a resource
to support further research and development in cross-lingual language
understanding.
♻ ☆ MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models
in Chinese, Indonesian, Malay, and Singlish
♻ ☆ Do LLMs Really Think Step-by-step In Implicit Reasoning?
It has been well-known that Chain-of-Thought can remarkably enhance LLMs'
performance on complex tasks. However, because it also introduces slower
inference speeds and higher computational costs, many researches have attempted
to use implicit CoT, which does not need LLMs to explicitly generate the
intermediate steps. However, the invisible reasoning process leaves us a doubt
that, can implicit CoT really be equal to explicit CoT? Therefore, in this
study, we address this question through experiments. We probe the information
of intermediate steps from the model's hidden states when it is either trained
or prompted to perform implicit CoT. The results surprisingly indicate that
when prompted, LLMs hardly think about intermediate steps, suggesting they may
just rely on experience rather than strict step-by-step reasoning. But when
trained, they indeed calculate intermediate steps. Moreover, in both
situations, we find the effect of using implicit CoT is susceptible to the
format of the problem, reaffirming the current deficiency of implicit CoT.
♻ ☆ Do LLMs Really Think Step-by-step In Implicit Reasoning?
comment: The code is in
https://github.com/yuyijiong/if_step_by_step_implicit_CoT
♻ ☆ MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning
in One Network), the first speech-text model tailored for Singapore's
multilingual and multicultural landscape. Developed under the National Large
Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates
advanced speech and text processing to address the diverse linguistic nuances
of local accents and dialects, enhancing accessibility and usability in
complex, multilingual environments. Our results demonstrate improvements in
both speech recognition and task-specific understanding, positioning
MERaLiON-AudioLLM as a pioneering solution for region specific AI applications.
We envision this release to set a precedent for future models designed to
address localised linguistic and cultural contexts in a global framework.
♻ ☆ MERaLiON-AudioLLM: Bridging Audio and Language with Large Language
Models
comment: https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION
♻ ☆ CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics
In the field of crisis/disaster informatics, social media is increasingly
being used for improving situational awareness to inform response and relief
efforts. Efficient and accurate text classification tools have been a focal
area of investigation in crisis informatics. However, current methods mostly
rely on single-label text classification models, which fails to capture
different insights embedded in dynamic and multifaceted disaster-related social
media data. This study introduces a novel approach to disaster text
classification by enhancing a pre-trained Large Language Model (LLM) through
instruction fine-tuning targeted for multi-label classification of
disaster-related tweets. Our methodology involves creating a comprehensive
instruction dataset from disaster-related tweets, which is then used to
fine-tune an open-source LLM, thereby embedding it with disaster-specific
knowledge. This fine-tuned model can classify multiple aspects of
disaster-related information simultaneously, such as the type of event,
informativeness, and involvement of human aid, significantly improving the
utility of social media data for situational awareness in disasters. The
results demonstrate that this approach enhances the categorization of critical
information from social media posts, thereby facilitating a more effective
deployment for situational awareness during emergencies. This research paves
the way for more advanced, adaptable, and robust disaster management tools,
leveraging the capabilities of LLMs to improve real-time situational awareness
and response strategies in disaster scenarios.
♻ ☆ CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for
Multi-label Social Media Text Classification in Disaster Informatics
comment: Relevant source code and data is available:
https://github.com/KaiYin97/CrsisLLM
♻ ☆ A General Framework for Inference-time Scaling and Steering of Diffusion Models
Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Diffusion models produce impressive results in modalities ranging from images
and video to protein design and text. However, generating samples with
user-specified properties remains a challenge. Recent research proposes
fine-tuning models to maximize rewards that capture desired properties, but
these methods require expensive training and are prone to mode collapse. In
this work, we propose Feynman Kac (FK) steering, an inference-time framework
for steering diffusion models with reward functions. FK steering works by
sampling a system of multiple interacting diffusion processes, called
particles, and resampling particles at intermediate steps based on scores
computed using functions called potentials. Potentials are defined using
rewards for intermediate states and are selected such that a high value
indicates that the particle will yield a high-reward sample. We explore various
choices of potentials, intermediate rewards, and samplers. We evaluate FK
steering on text-to-image and text diffusion models. For steering text-to-image
models with a human preference reward, we find that FK steering a 0.8B
parameter model outperforms a 2.6B parameter fine-tuned model on prompt
fidelity, with faster sampling and no training. For steering text diffusion
models with rewards for text quality and specific text attributes, we find that
FK steering generates lower perplexity, more linguistically acceptable outputs
and enables gradient-free control of attributes like toxicity. Our results
demonstrate that inference-time scaling and steering of diffusion models, even
with off-the-shelf rewards, can provide significant sample quality gains and
controllability benefits. Code is available at
https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .
♻ ☆ A General Framework for Inference-time Scaling and Steering of Diffusion
Models
♻ ☆ Surveying Attitudinal Alignment Between Large Language Models Vs. Humans Towards 17 Sustainable Development Goals
Qingyang Wu, Ying Xu, Tingsong Xiao, Yunze Xiao, Yitong Li, Tianyang Wang, Yichi Zhang, Shanghai Zhong, Yuwei Zhang, Wei Lu, Yifan Yang
Large Language Models (LLMs) have emerged as potent tools for advancing the
United Nations' Sustainable Development Goals (SDGs). However, the attitudinal
disparities between LLMs and humans towards these goals can pose significant
challenges. This study conducts a comprehensive review and analysis of the
existing literature on the attitudes of LLMs towards the 17 SDGs, emphasizing
the comparison between their attitudes and support for each goal and those of
humans. We examine the potential disparities, primarily focusing on aspects
such as understanding and emotions, cultural and regional differences, task
objective variations, and factors considered in the decision-making process.
These disparities arise from the underrepresentation and imbalance in LLM
training data, historical biases, quality issues, lack of contextual
understanding, and skewed ethical values reflected. The study also investigates
the risks and harms that may arise from neglecting the attitudes of LLMs
towards the SDGs, including the exacerbation of social inequalities, racial
discrimination, environmental destruction, and resource wastage. To address
these challenges, we propose strategies and recommendations to guide and
regulate the application of LLMs, ensuring their alignment with the principles
and goals of the SDGs, and therefore creating a more just, inclusive, and
sustainable future.
♻ ☆ Surveying Attitudinal Alignment Between Large Language Models Vs. Humans
Towards 17 Sustainable Development Goals
♻ ☆ PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging
Multimodal large language models (MLLMs) represent an evolutionary expansion
in the capabilities of traditional large language models, enabling them to
tackle challenges that surpass the scope of purely text-based applications. It
leverages the knowledge previously encoded within these language models,
thereby enhancing their applicability and functionality in the reign of
multimodal contexts. Recent works investigate the adaptation of MLLMs as a
universal solution to address medical multi-modal problems as a generative
task. In this paper, we propose a parameter efficient framework for fine-tuning
MLLMs, specifically validated on medical visual question answering (Med-VQA)
and medical report generation (MRG) tasks, using public benchmark datasets. We
also introduce an evaluation metric using the 5-point Likert scale and its
weighted average value to measure the quality of the generated reports for MRG
tasks, where the scale ratings are labelled by both humans manually and the
GPT-4 model. We further assess the consistency of performance metrics across
traditional measures, GPT-4, and human ratings for both VQA and MRG tasks. The
results indicate that semantic similarity assessments using GPT-4 align closely
with human annotators and provide greater stability, yet they reveal a
discrepancy when compared to conventional lexical similarity measurements. This
questions the reliability of lexical similarity metrics for evaluating the
performance of generative models in Med-VQA and report generation tasks.
Besides, our fine-tuned model significantly outperforms GPT-4v. This indicates
that without additional fine-tuning, multi-modal models like GPT-4v do not
perform effectively on medical imaging tasks. The code will be available here:
https://github.com/jinlHe/PeFoMed.
♻ ☆ PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language
Models for Medical Imaging
comment: 12 pages, 8 figures, 12 tables
♻ ☆ BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages NeurIPS 2024
Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jose Camacho-Collados, Alice Oh
Large language models (LLMs) often lack culture-specific knowledge of daily
life, especially across diverse regions and non-English languages. Existing
benchmarks for evaluating LLMs' cultural sensitivities are limited to a single
language or collected from online sources such as Wikipedia, which do not
reflect the mundane everyday lifestyles of diverse regions. That is,
information about the food people eat for their birthday celebrations, spices
they typically use, musical instruments youngsters play, or the sports they
practice in school is common cultural knowledge but uncommon in easily
collected online sources, especially for underrepresented cultures. To address
this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate
LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises
52.6k question-answer pairs from 16 countries/regions, in 13 different
languages, including low-resource ones such as Amharic, Assamese, Azerbaijani,
Hausa, and Sundanese. We construct the benchmark to include two formats of
questions: short-answer and multiple-choice. We show that LLMs perform better
for cultures that are highly represented online, with a maximum 57.34%
difference in GPT-4, the best-performing model, in the short-answer format. For
cultures represented by mid-to-high-resource languages, LLMs perform better in
their local languages, but for cultures represented by low-resource languages,
LLMs perform better in English than the local languages. We make our dataset
publicly available at: https://github.com/nlee0212/BLEnD.
♻ ☆ BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures
and Languages NeurIPS 2024
comment: Accepted to NeurIPS 2024 Datasets & Benchmark Track